t-SNE: Visualizing High-Dimensional Data (Preserve Friendships, Not Geography)#
t-SNE (t-distributed Stochastic Neighbor Embedding) is a visualization tool: it tries to place points in 2D/3D so that nearby points in the original space stay nearby.
A useful mental model:
People standing close at a party: you can’t keep everyone’s exact GPS position when you redraw the room, but you can keep friend groups clustered.
Preserving friendships, not geography: t-SNE cares about who counts as a neighbor, not about the global map scale.
This notebook builds intuition first, then shows the probabilistic objective, then runs a small t-SNE optimizer (from scratch) so we can see optimization happening.
Learning goals#
By the end you should be able to:
explain t-SNE as “probabilities over neighbors”
interpret perplexity as a knob for “how many neighbors matter”
explain why t-SNE uses a Student-t distribution in low dimensions (the crowding problem)
recognize strengths/pitfalls (beautiful clusters, dishonest distances)
compare t-SNE vs PCA vs UMAP at a high level
Notation (quick)#
High-dimensional points: \(x_i \in \mathbb{R}^D\)
Low-dimensional embedding: \(y_i \in \mathbb{R}^d\) (usually \(d=2\))
Similarities as probabilities: \(p_{ij}\) in high-D, \(q_{ij}\) in low-D
Table of contents#
Intuition
Probabilistic explanation
Algorithm mechanics (perplexity, optimization, early exaggeration)
Plotly visualizations
effect of perplexity
optimization animation (from scratch)
crowding problem illustration
Strengths & pitfalls
Comparison: t-SNE vs UMAP vs PCA
Exercises
References
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Intuition: people at a party#
Imagine each data point is a person, and their features are things like music taste, hobbies, and politics.
In the real feature space (\(D\) dimensions), you can measure who is “close” to whom — but you can’t draw it.
t-SNE asks a very specific question:
“For each person, who are their closest friends?”
Then it tries to place the people in 2D so that:
close friends stay close (preserving friendships)
everyone else is allowed to move around as long as they don’t break those friendships (not preserving geography)
That’s why t-SNE plots look like convincing clusters: they’re optimized to make neighborhoods look good.
# Dataset used throughout: handwritten digits (64 features per sample)
digits = load_digits()
X_all = digits.data.astype(float)
y_all = digits.target.astype(int)
images_all = digits.images
# Distance-based methods are sensitive to feature scaling
X_all = StandardScaler().fit_transform(X_all)
# Keep things fast for visualization
n_samples = 900
idx = rng.choice(X_all.shape[0], size=n_samples, replace=False)
X = X_all[idx]
y = y_all[idx]
images = images_all[idx]
# A common practical trick: PCA before t-SNE (denoise + speed)
X_pca = PCA(n_components=50).fit_transform(X)
i = int(rng.integers(0, n_samples))
fig = px.imshow(
images[i],
color_continuous_scale="gray",
title=f"One sample from the dataset (label={y[i]})",
)
fig.update_layout(coloraxis_showscale=False)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
X.shape, X_pca.shape
((900, 64), (900, 50))
2) Probabilistic explanation#
High-dimensional similarities as probabilities#
Instead of trying to preserve raw distances, t-SNE turns “closeness” into probabilities.
For each point \(x_i\), define the probability that \(x_j\) is a neighbor:
This is a Gaussian kernel around \(x_i\).
Each point gets its own bandwidth \(\sigma_i\) (chosen via perplexity).
Then we symmetrize into a single joint distribution:
Low-dimensional similarities: Gaussian vs Student-t#
In the embedding space, we also define neighbor probabilities \(q_{ij}\).
If we used a Gaussian here too, moderately-distant points would get tiny probability mass. In 2D that causes the crowding problem (everything wants to sit on top of everything else).
t-SNE fixes this by using a heavy-tailed Student-t distribution with 1 degree of freedom:
Objective: KL divergence#
t-SNE chooses the embedding \(Y\) by minimizing the mismatch between these two distributions:
Key intuition about this KL direction:
If two points are neighbors in high-D (\(p_{ij}\) is big), but far in 2D (\(q_{ij}\) is small), you pay a big penalty.
If two points are far in high-D (tiny \(p_{ij}\)) but close in 2D, the penalty is smaller.
This “asymmetry” is part of why t-SNE creates separated, clean-looking islands.
3) Algorithm mechanics#
Perplexity (the “friend-circle size”)#
Perplexity is defined from the entropy of the neighbor distribution:
Interpretation: effective number of neighbors.
small perplexity (e.g. 5): focus on very local “best friends”
large perplexity (e.g. 50): wider social circle
Gradient descent#
The objective is non-convex, so:
different random seeds can give different looking layouts
you can trust local neighborhoods more than global geometry
Early exaggeration#
During the first phase, t-SNE temporarily multiplies \(P\) by a constant (often 12). A story:
“For a while, we pretend friendships are stronger than they really are.”
This makes local attraction stronger early on, helping clusters form before the layout settles.
def pairwise_squared_distances(X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
sum_sq = np.sum(X**2, axis=1)
D = sum_sq[:, None] + sum_sq[None, :] - 2 * (X @ X.T)
return np.maximum(D, 0.0)
def _Hbeta(dist_sq: np.ndarray, beta: float) -> tuple[float, np.ndarray]:
"""Shannon entropy H and probabilities for a distance vector at precision beta."""
P = np.exp(-dist_sq * beta)
sumP = float(np.sum(P))
if sumP == 0.0:
P = np.full_like(P, 1.0 / P.size)
H = float(np.log(P.size))
return H, P
H = float(np.log(sumP) + beta * np.sum(dist_sq * P) / sumP)
P = P / sumP
return H, P
def conditional_probabilities(
dist_sq_row: np.ndarray,
perplexity: float,
tol: float = 1e-5,
max_iter: int = 60,
) -> tuple[np.ndarray, float]:
"""Binary-search beta so that exp(H) ~= perplexity."""
dist_sq_row = np.asarray(dist_sq_row, dtype=float)
logU = float(np.log(perplexity))
beta = 1.0
beta_min = -np.inf
beta_max = np.inf
for _ in range(max_iter):
H, P = _Hbeta(dist_sq_row, beta)
Hdiff = H - logU
if abs(Hdiff) < tol:
break
if Hdiff > 0:
beta_min = beta
beta = beta * 2.0 if np.isinf(beta_max) else 0.5 * (beta + beta_max)
else:
beta_max = beta
beta = beta / 2.0 if np.isinf(beta_min) else 0.5 * (beta + beta_min)
return P, float(beta)
def perplexity_of(p: np.ndarray) -> float:
p = np.asarray(p, dtype=float)
p = p[p > 0]
H = -np.sum(p * np.log(p))
return float(np.exp(H))
# Demonstration: how perplexity changes the neighbor distribution for ONE point
X_demo = X_pca[:400]
y_demo = y[:400]
D_demo = pairwise_squared_distances(X_demo)
i0 = 0
mask = np.arange(D_demo.shape[0]) != i0
dist_row = D_demo[i0, mask]
perplexities = [5, 30, 80]
fig = go.Figure()
for perp in perplexities:
p_row, beta = conditional_probabilities(dist_row, perplexity=perp)
p_sorted = np.sort(p_row)[::-1]
fig.add_trace(
go.Scatter(
x=np.arange(1, p_sorted.size + 1),
y=p_sorted,
mode="lines",
name=f"target={perp} (actual={perplexity_of(p_row):.1f})",
)
)
fig.update_layout(
title=f"Perplexity as 'friend-circle size' (one point, label={y_demo[i0]})",
xaxis_title="Neighbor rank (sorted)",
yaxis_title="p(j | i)",
)
fig.update_yaxes(type="log")
fig.show()
4) Plotly visualizations#
We’ll focus on three things you can see:
how embeddings change with perplexity
how optimization “pulls neighborhoods together” over time
what the crowding problem is and why Student-t helps
# 4.1 Effect of different perplexities (sklearn t-SNE)
perplexities = [5, 30, 50]
embeddings = {}
for perp in perplexities:
model = TSNE(
n_components=2,
perplexity=perp,
init="pca",
learning_rate="auto",
max_iter=750,
random_state=42,
)
embeddings[perp] = model.fit_transform(X_pca)
fig = make_subplots(
rows=1,
cols=len(perplexities),
subplot_titles=[f"perplexity={p}" for p in perplexities],
)
for col, perp in enumerate(perplexities, start=1):
Z = embeddings[perp]
fig.add_trace(
go.Scatter(
x=Z[:, 0],
y=Z[:, 1],
mode="markers",
marker=dict(
color=y,
colorscale="Turbo",
size=6,
opacity=0.85,
line=dict(width=0.2, color="white"),
showscale=(col == len(perplexities)),
colorbar=dict(title="digit") if col == len(perplexities) else None,
),
text=[f"digit={d}" for d in y],
hovertemplate="%{text}<br>x=%{x:.2f}<br>y=%{y:.2f}<extra></extra>",
showlegend=False,
),
row=1,
col=col,
)
fig.update_xaxes(title_text="t-SNE 1", row=1, col=col)
fig.update_yaxes(title_text="t-SNE 2" if col == 1 else "", row=1, col=col)
fig.update_layout(
title="t-SNE changes with perplexity (same data, different 'friend-circle size')",
height=370,
width=1100,
)
fig.show()
4.2 Animation of optimization (from scratch)#
To animate optimization, we’ll run a small from-scratch t-SNE loop on a subset. This is not meant to be the fastest implementation; it’s meant to make the mechanics visible.
def compute_joint_probabilities(
X: np.ndarray,
perplexity: float,
tol: float = 1e-5,
max_iter: int = 60,
eps: float = 1e-12,
) -> np.ndarray:
X = np.asarray(X, dtype=float)
n = X.shape[0]
D = pairwise_squared_distances(X)
P = np.zeros((n, n), dtype=float)
for i in range(n):
mask = np.arange(n) != i
p_row, _ = conditional_probabilities(D[i, mask], perplexity=perplexity, tol=tol, max_iter=max_iter)
P[i, mask] = p_row
P = (P + P.T) / (2.0 * n)
np.fill_diagonal(P, 0.0)
P = np.maximum(P, eps)
P = P / np.sum(P)
return P
def kl_divergence(P: np.ndarray, Q: np.ndarray, eps: float = 1e-12) -> float:
P = np.asarray(P, dtype=float)
Q = np.asarray(Q, dtype=float)
return float(np.sum(P * (np.log(P + eps) - np.log(Q + eps))))
def optimize_embedding(
P: np.ndarray,
*,
Y_init: np.ndarray | None = None,
n_iter: int = 350,
learning_rate: float = 200.0,
momentum_early: float = 0.5,
momentum_late: float = 0.8,
momentum_switch_iter: int = 250,
early_exaggeration: float = 12.0,
exaggeration_iters: int = 120,
save_every: int = 10,
kernel: str = "student_t",
seed: int = 42,
eps: float = 1e-12,
) -> tuple[list[tuple[int, np.ndarray]], np.ndarray]:
"""Optimize Y with either t-SNE (Student-t) or symmetric SNE (Gaussian)."""
P = np.asarray(P, dtype=float)
n = P.shape[0]
if kernel not in {"student_t", "gaussian"}:
raise ValueError("kernel must be 'student_t' or 'gaussian'")
if Y_init is None:
local_rng = np.random.default_rng(seed)
Y = local_rng.normal(0.0, 1e-4, size=(n, 2))
else:
Y = np.asarray(Y_init, dtype=float).copy()
iY = np.zeros_like(Y)
gains = np.ones_like(Y)
min_gain = 0.01
history: list[tuple[int, np.ndarray]] = []
kls = np.zeros(n_iter, dtype=float)
for it in range(n_iter):
P_use = P * (early_exaggeration if it < exaggeration_iters else 1.0)
sum_Y = np.sum(Y**2, axis=1)
D = np.maximum(sum_Y[:, None] + sum_Y[None, :] - 2.0 * (Y @ Y.T), 0.0)
if kernel == "student_t":
num = 1.0 / (1.0 + D)
else:
num = np.exp(-D)
np.fill_diagonal(num, 0.0)
Q = num / np.sum(num)
Q = np.maximum(Q, eps)
if kernel == "student_t":
PQ = (P_use - Q) * num
else:
PQ = P_use - Q
dY = 4.0 * (Y * PQ.sum(axis=1)[:, None] - PQ @ Y)
gains = (gains + 0.2) * ((dY > 0) != (iY > 0)) + (gains * 0.8) * ((dY > 0) == (iY > 0))
gains = np.maximum(gains, min_gain)
momentum = momentum_early if it < momentum_switch_iter else momentum_late
iY = momentum * iY - learning_rate * gains * dY
Y = Y + iY
Y = Y - np.mean(Y, axis=0)
kls[it] = kl_divergence(P, Q, eps=eps)
if it % save_every == 0 or it == n_iter - 1:
history.append((it, Y.copy()))
return history, kls
n_anim = 320
idx_anim = rng.choice(X_pca.shape[0], size=n_anim, replace=False)
X_anim = X_pca[idx_anim]
y_anim = y[idx_anim]
perplexity_anim = 30
P_anim = compute_joint_probabilities(X_anim, perplexity=perplexity_anim)
Y0 = rng.normal(0.0, 1e-4, size=(n_anim, 2))
history_tsne, kl_tsne = optimize_embedding(
P_anim,
Y_init=Y0,
n_iter=360,
save_every=10,
exaggeration_iters=120,
early_exaggeration=12.0,
kernel="student_t",
)
fig = px.line(
x=np.arange(kl_tsne.size),
y=kl_tsne,
title="From-scratch t-SNE: KL divergence over iterations",
labels={"x": "iteration", "y": "KL(P || Q)"},
)
fig.add_vline(
x=120,
line_dash="dash",
line_color="black",
annotation_text="early exaggeration ends",
)
fig.show()
# Animation: embedding positions as optimization progresses
frames = []
steps = []
marker = dict(
color=y_anim,
colorscale="Turbo",
size=7,
opacity=0.85,
line=dict(width=0.25, color="white"),
showscale=True,
colorbar=dict(title="digit"),
)
for it, Y in history_tsne:
frames.append(
go.Frame(
data=[
go.Scatter(
x=Y[:, 0],
y=Y[:, 1],
mode="markers",
marker=marker,
text=[f"digit={d}" for d in y_anim],
hovertemplate="%{text}<br>x=%{x:.2f}<br>y=%{y:.2f}<extra></extra>",
showlegend=False,
)
],
name=str(it),
layout=go.Layout(title_text=f"t-SNE optimization (from scratch) — iteration {it}"),
)
)
steps.append(
dict(
method="animate",
args=[[str(it)], {"mode": "immediate", "frame": {"duration": 300, "redraw": True}, "transition": {"duration": 0}}],
label=str(it),
)
)
it0, Y_first = history_tsne[0]
fig = go.Figure(
data=[
go.Scatter(
x=Y_first[:, 0],
y=Y_first[:, 1],
mode="markers",
marker=marker,
text=[f"digit={d}" for d in y_anim],
hovertemplate="%{text}<br>x=%{x:.2f}<br>y=%{y:.2f}<extra></extra>",
showlegend=False,
)
],
layout=go.Layout(
title=f"t-SNE optimization (from scratch) — iteration {it0}",
xaxis=dict(title="y1", zeroline=False),
yaxis=dict(title="y2", zeroline=False, scaleanchor="x", scaleratio=1),
height=520,
width=720,
updatemenus=[
dict(
type="buttons",
showactive=False,
y=1.05,
x=1.0,
xanchor="right",
buttons=[
dict(
label="Play",
method="animate",
args=[None, {"fromcurrent": True, "frame": {"duration": 300, "redraw": True}, "transition": {"duration": 0}}],
),
dict(
label="Pause",
method="animate",
args=[[None], {"mode": "immediate", "frame": {"duration": 0, "redraw": False}, "transition": {"duration": 0}}],
),
],
)
],
sliders=[
dict(
active=0,
currentvalue={"prefix": "iteration="},
pad={"t": 50},
steps=steps,
)
],
),
frames=frames,
)
fig.show()
4.3 Crowding problem illustration#
Why does t-SNE use Student-t in 2D?
A geometric story:
In high dimensions, there’s “room” for many points to be at moderate distances.
In 2D, if you try to keep all those moderate distances, points run out of room and get crowded.
Using a heavy tail makes it easier to push moderately-distant points further apart.
# Gaussian vs Student-t tails (shape comparison)
d = np.linspace(0, 10, 400)
sigma = 1.0
gaussian = np.exp(-(d**2) / (2 * sigma**2))
student_t = 1.0 / (1.0 + d**2)
fig = make_subplots(rows=1, cols=2, subplot_titles=["Linear scale", "Log scale (tails matter)"])
fig.add_trace(go.Scatter(x=d, y=gaussian, mode="lines", name="Gaussian"), row=1, col=1)
fig.add_trace(go.Scatter(x=d, y=student_t, mode="lines", name="Student-t (df=1)"), row=1, col=1)
fig.add_trace(go.Scatter(x=d, y=gaussian, mode="lines", showlegend=False), row=1, col=2)
fig.add_trace(go.Scatter(x=d, y=student_t, mode="lines", showlegend=False), row=1, col=2)
fig.update_xaxes(title_text="distance", row=1, col=1)
fig.update_yaxes(title_text="unnormalized similarity", row=1, col=1)
fig.update_xaxes(title_text="distance", row=1, col=2)
fig.update_yaxes(type="log", title_text="unnormalized similarity (log)", row=1, col=2)
fig.update_layout(title="Heavy tails give distant points more 'influence' (less crowding)")
fig.show()
# Same P, same init, different low-D kernel: symmetric SNE (Gaussian) vs t-SNE (Student-t)
history_gauss, kl_gauss = optimize_embedding(
P_anim,
Y_init=Y0,
n_iter=360,
save_every=360,
exaggeration_iters=120,
early_exaggeration=12.0,
kernel="gaussian",
)
Y_gauss = history_gauss[-1][1]
Y_tsne = history_tsne[-1][1]
fig = make_subplots(rows=1, cols=2, subplot_titles=["Gaussian low-D kernel (crowding)", "Student-t low-D kernel (t-SNE)"])
for col, Z in enumerate([Y_gauss, Y_tsne], start=1):
fig.add_trace(
go.Scatter(
x=Z[:, 0],
y=Z[:, 1],
mode="markers",
marker=dict(color=y_anim, colorscale="Turbo", size=6, opacity=0.85, line=dict(width=0.2, color="white"), showscale=(col == 2), colorbar=dict(title="digit") if col == 2 else None),
showlegend=False,
),
row=1,
col=col,
)
fig.update_xaxes(title_text="dim 1", row=1, col=col)
fig.update_yaxes(title_text="dim 2" if col == 1 else "", row=1, col=col, scaleanchor="x", scaleratio=1)
fig.update_layout(title="Crowding: Gaussian vs Student-t in the low-dimensional space", height=420, width=1000)
fig.show()
/tmp/ipykernel_650858/1100078143.py:80: RuntimeWarning:
invalid value encountered in divide
5) Strengths & pitfalls#
Strengths#
Great for local structure: who is similar to whom.
Great for exploration: clusters, subclusters, outliers.
Often reveals nonlinear manifolds that PCA can’t.
Pitfalls (the important ones)#
Clusters look convincing but distances lie: the spacing between clusters is not trustworthy.
Non-parametric: standard t-SNE doesn’t learn a function you can apply to new points; you typically rerun it.
Sensitive knobs: perplexity, learning rate, early exaggeration, random seed.
Not a modeling tool: it’s for visualization/diagnostics, not for predictive performance.
A practical checklist:
try multiple seeds and perplexities
don’t interpret global distances
validate clusters with labels/metadata when you can
6) Comparison: t-SNE vs UMAP vs PCA#
Method |
What it tries to preserve |
Strengths |
Typical pitfalls |
|---|---|---|---|
PCA |
global variance (linear) |
fast, stable, interpretable |
misses nonlinear structure |
t-SNE |
local neighborhoods (probabilistic) |
excellent cluster visualization |
global geometry not meaningful; slow; non-parametric |
UMAP |
local neighborhoods + more global structure (graph) |
fast; often preserves more global structure; can transform new points |
hyperparameters matter; can also make “convincing” clusters |
Rule of thumb:
Start with PCA (baseline, sanity check).
Use t-SNE when you want the cleanest view of local neighborhoods.
Use UMAP when you want speed + a bit more global structure (and often an easier path to mapping new points).
# Visual comparison on the same dataset
Z_pca2 = PCA(n_components=2).fit_transform(X_pca)
if "embeddings" in globals() and 30 in embeddings:
Z_tsne30 = embeddings[30]
else:
Z_tsne30 = TSNE(
n_components=2,
perplexity=30,
init="pca",
learning_rate="auto",
max_iter=750,
random_state=42,
).fit_transform(X_pca)
Z_umap = None
try:
import umap
Z_umap = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42).fit_transform(X_pca)
except Exception as e:
print("UMAP is optional. Install with: pip install umap-learn")
print(f"Reason: {e}")
embeds = [("PCA", Z_pca2), ("t-SNE (perp=30)", Z_tsne30)]
if Z_umap is not None:
embeds.append(("UMAP", Z_umap))
fig = make_subplots(rows=1, cols=len(embeds), subplot_titles=[name for name, _ in embeds])
for col, (name, Z) in enumerate(embeds, start=1):
fig.add_trace(
go.Scatter(
x=Z[:, 0],
y=Z[:, 1],
mode="markers",
marker=dict(
color=y,
colorscale="Turbo",
size=6,
opacity=0.85,
line=dict(width=0.2, color="white"),
showscale=(col == len(embeds)),
colorbar=dict(title="digit") if col == len(embeds) else None,
),
showlegend=False,
),
row=1,
col=col,
)
fig.update_xaxes(title_text="dim 1", row=1, col=col)
fig.update_yaxes(title_text="dim 2" if col == 1 else "", row=1, col=col)
fig.update_layout(title="PCA vs t-SNE vs (optional) UMAP on digits", height=380, width=1100)
fig.show()
UMAP is optional. Install with: pip install umap-learn
Reason: No module named 'umap'
Exercises#
Run the perplexity sweep with values 2, 10, 100. What changes? What stays stable?
Pick a single point and look at its nearest neighbors in the original space (by Euclidean distance). Do they stay neighbors in t-SNE?
Change the random seed and rerun the from-scratch optimizer. Which conclusions remain valid?
Try initializing with PCA for the from-scratch implementation. Does it converge faster?
Compare t-SNE and UMAP on a different dataset (e.g.
make_moons,make_swiss_roll).
References#
van der Maaten & Hinton (2008): Visualizing Data using t-SNE
van der Maaten (2014): Accelerating t-SNE using Tree-Based Algorithms (Barnes-Hut t-SNE)
McInnes et al. (2018): UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
scikit-learn documentation:
sklearn.manifold.TSNE